Scaling SGD Batch Size to 32K for ImageNet Training

نویسندگان

Yang You

Igor Gitman

Boris Ginsburg

چکیده

The most natural way to speed-up the training of large networks is to use dataparallelism on multiple GPUs. To scale Stochastic Gradient (SG) based methods to more processors, one need to increase the batch size to make full use of the computational power of each GPU. However, keeping the accuracy of network with increase of batch size is not trivial. Currently, the state-of-the art method is to increase Learning Rate (LR) proportional to the batch size, and use special learning rate with "warm-up" policy to overcome initial optimization difficulty. By controlling the LR during the training process, one can efficiently use largebatch in ImageNet training. For example, Batch-1024 for AlexNet and Batch-8192 for ResNet-50 are successful applications. However, for ImageNet-1k training, state-of-the-art AlexNet only scales the batch size to 1024 and ResNet50 only scales it to 8192. The reason is that we can not scale the learning rate to a large value. To enable large-batch training to general networks or datasets, we propose Layer-wise Adaptive Rate Scaling (LARS). LARS LR uses different LRs for different layers based on the norm of the weights (||w||) and the norm of the gradients (||∇w||). By using LARS algoirithm, we can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet. Large batch can make full use of the system’s computational power. For example, batch-4096 can achieve 3× speedup over batch-512 for ImageNet training by AlexNet model on a DGX-1 station (8 P100 GPUs).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

We demonstrate that training ResNet-50 on ImageNet for 90 epochs can be achieved in 15 minutes with 1024 Tesla P100 GPUs. This was made possible by using a large minibatch size of 32k. To maintain accuracy with this large minibatch size, we employed several techniques such as RMSprop warm-up, batch normalization without moving averages, and a slow-start learning rate schedule. This paper also d...

متن کامل

ImageNet Training in 24 Minutes

Finishing 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days. This training requires 10 single precision operations in total. On the other hand, the world’s current fastest supercomputer can finish 2× 10 single precision operations per second (Dongarra et al. 2017). If we can make full use of the supercomputer for DNN training, we should be able to finish the 90-epoc...

متن کامل

Don't Decay the Learning Rate, Increase the Batch Size

It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but wi...

متن کامل

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be larg...

متن کامل

How to scale distributed deep learning?

Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS). To minimize training time, the training of a deep neural network must be scaled beyond a single machine to as many machines as possible by distributing the ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

CoRR

دوره abs/1708.03888 شماره

صفحات -

تاریخ انتشار 2017

Scaling SGD Batch Size to 32K for ImageNet Training

نویسندگان

چکیده

منابع مشابه

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

ImageNet Training in 24 Minutes

Don't Decay the Learning Rate, Increase the Batch Size

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

How to scale distributed deep learning?

عنوان ژورنال:

اشتراک گذاری